Designing a Multimodal Spoken Component of the Australian National Corpus

نویسنده

  • Michael Haugh
چکیده

Spoken language and interaction lie at the core of human experience. The primary medium of communication is speech, with some estimating the ratio of spoken-written language to be as high as 90%-10% (Cermák, 2009, p. 115). Yet they have remained poor cousins in the building of corpora to date. Not only are spoken corpora much smaller than written corpora (Xiao, 2008), the overwhelming focus in the analysis of spoken corpora has been on textual transcriptions of audio recordings, with the original recordings themselves generally not being widely available (Wichmann, 2008, p. 189). In the most comprehensive, large-scale national corpus to have included a spoken component to date, the British National Corpus, for instance, the ratio of spoken to written language is inversely proportionate to that estimated to reflect actual communicative practice. Moreover, as the original sound recordings are not (widely) available, researchers are limited to analysing textual representations of spoken interaction. As a result of these constraints, work on spoken corpora has largely focused on the analysis of lexical and grammatical features of spoken language (Wichmann, 2008, p. 189). However, as Adolphs and Carter (2007) have recently argued “while current corpora allow us to explore multimillion word databases, they fail to represent language and communication beyond the word. This is problematic as social interactions are in fact multimodal, combining both verbal and non-verbal elements” (p.133). The increasing recognition that language needs to be studied in situ necessitates the building of multimodal corpora that allow such analyses to be undertaken (Allwood, 2008, p. 223). Yet while building large spoken corpora that are (at least partially) multimodal appears to be a way forward in redressing the relative neglect of spoken language in corpora to date, such endeavours are likely to be enormously time-consuming and expensive if the myriad of challenges facing those who wish to build such corpora are not carefully unpacked in the initial stages of design. The aim of this paper is thus to consider some of the main challenges involved in designing an Australian National Corpus (AusNC) that is multimodal. The paper begins by outlining what constitutes a multimodal corpus, and drawing a distinction between multimodal text corpora and multimodal spoken corpora, the latter of which is the primary focus in this paper. The case for why a multimodal spoken component of the AusNC is to be favoured over traditional approaches to spoken corpora is then outlined. Some of the key challenges that arise in designing a multimodal spoken corpus are next explored. In light of such a complex array of challenges, it is concluded that the principles outlined in Agile Corpus Creation theory (Voorman & Gut, 2008) constitute the most pragmatic way forward in designing and building a multimodal spoken component of the AusNC.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Developing a Quality Spoken Component of the Australian National Corpus

The creation of a quality spoken component of the Australian National Corpus (AusNC) will allow us to deepen our understandings of Australian English (AusE) and to open up new areas of analysis. To make the most of this opportunity we contend that not only must the data be of high quality but that the corpus must also be constructed in such a way that the data is of maximal use to researchers w...

متن کامل

Work on Spoken (Multimodal) Language Corpora in South Africa

This paper describes past, ongoing and planned work on the collection and transcription of spoken language samples for all the South African official languages and as part of this the training of researchers in corpus linguistic research skills. More specifically the work has involved (and still involves) establishing an international corpus linguistic network linked to a network hub at a UNISA...

متن کامل

Towards the Design of the Australian National Corpus

Corpora are becoming more and more important as a research tool for linguists as they are large collections of authentic text. However, not every researcher has the time and resources to compile their own corpus. Large corpora in the world such as the BNC, the ANC or the International Corpus of English (ICE) have been widely used for research on the English language in general or an English dia...

متن کامل

Introduction: Compiling and analysing the Spoken British National Corpus 2014

For over twenty years, the British National Corpus has been one of the most widely known and used corpora. It is almost impossible to attend an international corpus linguistics conference such as Corpus Linguistics, ICAME (International Computer Archive of Modern and Medieval English), AACL (American Association for Corpus Linguistics) or APCLC (Asia Pacific Corpus Linguistics Conference) witho...

متن کامل

A Multimodal Corpus of Rapid Dialogue Games

This paper presents a multimodal corpus of spoken human-human dialogues collected as participants played a series of Rapid Dialogue Games (RDGs). The corpus consists of a collection of about 11 hours of spoken audio, video, and Microsoft Kinect data taken from 384 game interactions (dialogues). The games used for collecting the corpus required participants to give verbal descriptions of linguis...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009